Data refining engine for high performance analysis system and method

ABSTRACT

Price and product attributes from webpages are imported, indexed, analyzed, and made available to be searched in close-to realtime, allowing search for price changes specific to products on individual webpages and for products across all webpages as well as to identify longitudinal correlations between price changes and product attributes. Users may search the data and set alerts.

CROSS-REFERENCE TO AND INCORPORATION BY REFERENCE OF RELATEDAPPLICATIONS

This application claims the benefit of and incorporates by referenceU.S. Provisional Patent Application No. 61/675,492, filed on Jul. 25,2012. This application is a continuation-in-part of and also claims thebenefit of and incorporates by reference issued U.S. patent applicationSer. No. 13/951,244, filed on Jul. 25, 2013, issued as U.S. Pat. No.9,047,614, and titled, “Adaptive Gathering of Structured andUnstructured Data System and Method,” U.S. patent application Ser. No.14/726,707, filed on Jun. 1, 2015, titled, “Adaptive Gathering ofStructured and Unstructured Data System and Method,” and U.S. patentapplication Ser. No. 13/951,248, filed on Jul. 25, 2013, titled, “DataRefining Engine for High Performance Analysis System and Method.”

FIELD

This disclosure relates to a method and system to analyze price andproduct information.

BACKGROUND

The following description includes information that may be useful inunderstanding the present invention. It is not an admission that any ofthe information provided herein is prior art or relevant to thepresently claimed invention, or that any publication specifically orimplicitly referenced is prior art.

Search engines, such as Google, Bing, and others search and index vastquantities of information on the Internet. “Crawlers” (a.k.a. “spiders”)utilize URIs obtained from a “queue” to obtain content, usually from webpages. The crawlers or other software store and index some of thecontent. Users can then search the indexed content, view results, andfollow hyperlinks back to the original source or to the stored content(the stored content often being referred to as a “cache”). Computingresources to crawl and index, however, are not limitless. The URI queuesare commonly prioritized to direct crawler resources to web page serverswhich can accommodate the traffic, which do not block crawlers (such asaccording to “robots.txt” files commonly available from webpageservers), which experience greater traffic from users, and whichexperience more change in content.

Conventional search engines, however, are not focused on price andproduct information. If a price changes on a webpage, but the rest ofthe webpage remains the same, traditional crawlers (or the queuemanager) will not prioritize the webpage position in the queue,generally because the price is a tiny fraction of the overall contentand the change is not labeled as being significant; conversely, if thewebpage changes, but the price and/or product information remains thesame, the change in webpage content may cause a traditional crawler toprioritize the webpage position in the queue due to the overall changein content, notwithstanding that that price and product informationremained the same.

Conventional search engines, if presented with a query, will findcorresponding products. For example, it is possible to search for “men'sshoes” and to then be presented with a webpage comprising search resultsfor hundreds of thousands of webpages for men's shoes. The search resultmay further be narrowed by category of men's shoes, brand, and store.Search engines have been incorporated into online stores, wherein a usermay search for products, by keyword and/or by category and results canbe ordered by price.

Price history, however, is only narrowly viewed and, when it is, neverin the context of a rich attribute set which explores, in detail, whichattributes are associated with changes in price. Price histories are notmade available in real time, and do not allow intricate comparisonsbased on stores, merchants, brands, regions, time/date, and otherdimensions.

When product and price data is obtained from a large number of webpages,when the webpages contain a large number of records, and when data fromthe large number of records is processed to discover product and pricerelationships which can only be teased out via data sets encompassinglarge swaths of economic activity, batch-based data ingestion andindexing processes which occur across days and cascading analyticdependencies will introduce delays. Such delays prevent the resultingcorpus from being searched in close-to-real time. Customers who desireto have new webpages searched and to benefit from discovering productand price relationships will be frustrated by batch process andcascading dependency delays; such customers will have reduced confidencethat product and price relationships are up-to-date.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a network and device diagram illustrating exemplary computingdevices configured according to embodiments disclosed in this paper.

FIG. 2 is a functional block diagram of an exemplary Indix Server 200computing device and some data structures and/or components thereof.

FIG. 3 is a functional block diagram of the Indix Datastore 300illustrated in the computing device of FIG. 2.

FIG. 4 is a flowchart illustrating an embodiment of an Analytics Module.

FIG. 5 is a flowchart illustrating an embodiment of a Core Price Module.

FIG. 6 is a flowchart illustrating an embodiment of an Insights Module.

FIG. 7 is a flowchart illustrating an embodiment of a Volatility Module.

FIG. 8A is a flowchart illustrating a first embodiment of a SubstitutionModule.

FIG. 8B is a flowchart illustrating a second embodiment of aSubstitution Module.

FIG. 8C is a flowchart illustrating a third embodiment of a SubstitutionModule.

FIG. 9 is a flowchart illustrating an embodiment of a Mix Module.

FIG. 10 is a flowchart illustrating an embodiment of a PredictionModule.

FIG. 11 is a flowchart illustrating an embodiment of a CompetitionModule.

FIG. 12 is a flowchart illustrating an embodiment of a Promotion Module.

FIG. 13 is a flowchart illustrating an embodiment of a LeadershipModule.

FIG. 14 is a flowchart illustrating an embodiment of a Premium Module.

FIG. 15 is a flowchart illustrating an embodiment of a Price RangeModule.

FIG. 16 is a flowchart illustrating an embodiment of a Reach Module.

FIG. 17 is a flowchart illustrating an embodiment of a User ContactModule.

FIG. 18 is a flowchart illustrating an embodiment of a Data IngestionModule.

FIG. 19 is a flowchart illustrating an embodiment of a Query Module.

FIG. 20 is a flowchart illustrating an embodiment of a Get Store IndexModule.

FIG. 21 is a flowchart illustrating an embodiment of a Search andAnalytics Index Module.

DETAILED DESCRIPTION

The following Detailed Description provides specific details for anunderstanding of various examples of the technology. One skilled in theart will understand that the technology may be practiced without many ofthese details. In some instances, structures and functions have not beenshown or described in detail or at all to avoid unnecessarily obscuringthe description of the examples of the technology. It is intended thatthe terminology used in the description presented below be interpretedin its broadest reasonable manner, even though it is being used inconjunction with a detailed description of certain examples of thetechnology. Although certain terms may be emphasized below, anyterminology intended to be interpreted in any restricted manner will beovertly and specifically defined as such in this Detailed Descriptionsection.

Unless the context clearly requires otherwise, throughout thedescription and the claims, the words “comprise,” “comprising,” and thelike are to be construed in an inclusive sense, as opposed to anexclusive or exhaustive sense; that is to say, in the sense of“including, but not limited to”. As used herein, the term “connected,”“coupled,” or any variant thereof means any connection or coupling,either direct or indirect between two or more elements; the coupling ofconnection between the elements can be physical, logical, or acombination thereof. Additionally, the words, “herein,” “above,”“below,” and words of similar import, when used in this application,shall refer to this application as a whole and not to particularportions of this application. When the context permits, words using thesingular may also include the plural while words using the plural mayalso include the singular. The word “or,” in reference to a list of twoor more items, covers all of the following interpretations of the word:any of the items in the list, all of the items in the list, and anycombination of one or more of the items in the list.

Certain elements appear in various of the Figures with the samecapitalized element text, but a different element number. When referredto herein with the capitalized element text but with no element number,these references should be understood to be largely equivalent and torefer to any of the elements with the same capitalized element text,though potentially with differences based on the computing device withinwhich the various embodiments of the element appears.

As used herein, a Uniform Resource Identifier (“URI”) is a string ofcharacters used to identify a resource on a computing device and/or anetwork, such as the Internet. Such identification enables interactionwith representations of the resource using specific protocols. “Schemes”specifying a syntax and associated protocols define each URI.

The generic syntax for URI schemes is defined in Request for Comments(“RFC”) memorandum 3986 published by the Internet Engineering Task Force(“IETF”). According to RFC 3986, a URI (including a URI) consists offour parts: <scheme name>: <hierarchical part>[?<query>] [#<fragment>]

A URI begins with a scheme name that refers to a specification forassigning identifiers within that scheme. The scheme name consists of aletter followed by any combination of letters, digits, and the plus(“+”), period (“”.), or hyphen (“-”) characters; and is terminated by acolon (“:”).

The hierarchical portion of the URI is intended to hold identificationinformation that is hierarchical in nature. Often this part isdelineated with a double forward slash (“//”), followed by an optionalauthority part and an optional path.

The optional authority part holds an optional user information part (notshown) terminated with “@” (e.g. username:password@), a hostname (i.e.,domain name or IP address, here “example.com”), and an optional portnumber preceded by a colon “:”.

The path part is a sequence of one or more segments (conceptuallysimilar to directories, though not necessarily representing them)separated by a forward slash (“/”). If a URI includes an authority part,then the path part may be empty.

The optional query portion is delineated with a question mark andcontains additional identification information that is not necessarilyhierarchical in nature. Together, the path part and the query portionidentify a resource within the scope of the URI's scheme and authority.

The query string syntax is not generically defined, but is commonlyorganized as a sequence of zero or more <key>=<value> pairs separated bya semicolon or ampersand, for example:key1=value1;key2=value2;key3=value3 (Semicolon), orkey1=value1&key2=value2&key3=value3 (Ampersand)

Much of the above information is taken from RFC 3986, which providesadditional information related to the syntax and structure of URIs. RFC3986 is hereby incorporated by reference, for all purposes.

As used herein, “Product” shall be understood to mean “products orservices”. References to “Product Attribute” herein shall be understoodto mean “product or service attribute”. As used herein, “Products” areassociated with iPIDs.

As used herein, an “iPID” or iPID 330 is a unique identifier assignedwithin the lndix System to a URI for a product, such as URI 305. An iPID330 may be, for example, a hash of URI 305. When multiple URIs 305 froma common base domain name lead to webpages which, when parsed for Priceand Product Attributes, produce the same Parse Result 325(notwithstanding that the webpages may contain other Content which doesnot contribute to the Parse Result 325), such iPIDS may be labeled asequivalent in, for example, the Equivalent iPID 334 record and may betreated as the same iPID 330.

As used herein, a “Master iPID” or “MPID” or MPID 332 is an iPID 330assigned to a group of iPIDs 330 derived from URIs 305 which lead towebpages offering the same Product for sale across all Merchants who maybe selling the Product. The process of picking an iPID 330 as the MPID332 for a group of iPID 330 records is described in greater detail inU.S. patent application Ser. Nos. 14/726,707 and 13/951,248. An MPIDgenerally identifies a single Product, generally produced by a commonmanufacturer, though the Product may be distributed and sold by multipleparties.

iPIDs and MPIDSs are associated with Price Attribute 340 records andProduct Attribute 345 records.

A Price Attribute 340 record may comprise one or more recordscomprising, for example, values which encode an iPRID which may be anidentifier for a price observed at a particular time, an iPID (discussedabove), a Product Name (a “Product Name” value in this record may alsobe referred to herein as a “Product”), a Standard Price, a Sale, aPrice, a Rebate amount, a Price Instructions record (containing specialinstructions relating to a price, such as that the price only applies tostudents), a Currency Type, a Date and Time Stamp, a Tax record, aShipping record (indicating costs relating to shipping to differentlocations, whether tax is calculated on shipping costs, etc.), a PriceValidity Start Date, a Price Validity End Date, a Quantity, a Unit ofMeasure Type, a Unit of Measure Value, a Merchant Name (with the name ofa merchant from whom the Product is available; a “Merchant Name” valuein this record may also be referred to herein as a “Merchant”), a StoreName (a Merchant may have multiple stores; a “Store Name” value in thisrecord may also be referred to herein as a “Store”), a User ID, a DataChannel (indicating the source of the Price Attribute 340 record, suchas an online crawl, a crowdsource, a licensed supplier of priceinformation, or from a merchant), a Source Details record (for example,indicating a URI, a newspaper advertisement), an Availability Flag, aPromotion Code, a Bundle Details record (indicating products which arepart of a bundle), a Condition Type record (indicating new, used, poor,good, and similar), a Social Rank record (indicating a rank of “likes”and similar of the price), a Votes/Likes record (indicating a number of“likes” and similar which a Price or Product has received), a Price Rankrecord, a Visibility Indicator record (indicating whether the price isvisible to the public, whether it is only visible to a Merchant, or thelike), a Supply Chain Reference record (indicating whether the price wasobtained from a retailer, a wholesaler, or another party in a supplychain), a Sale Location (indicating a geographic location where theproduct is available at the price), a Manufactured Location record(indicating where the product was produced or manufactured), a LaunchDate record (indicating how long the product has been on the market),and an Age of Product record (indicating how long the product was usedby the user). When capitalized herein, the foregoing terms (such asProduct, Price, Merchant, Store, Source Details, etc.) are meant torefer to values in a Price Attribute 340 record.

A Product Attribute 345 record may comprise, for example, valuesencoding features of or describing a Product. The entire ProductAttribute 345 schema may comprise thousands of columns, though only tensor hundreds of the columns may be applicable to any given Product. OftenProducts have industry or governmentally standardized identifiers, suchas Universal Product Code (“UPC”), Stock Keeping Unit (“SKU”),Manufacturer Part Number (“MPN”) or the like, which are also part of theProduct Attribute 345 schema and may be present in a Product Attribute345 record. An example set of values in a Product Attribute 345 recordfor a ring is as follows: Title, “Sterling Silver Diamond & Blue TopazRing”; Brand, “Blue Nile”; Category (such as, for example, a Category335 in a category schema), “ring”; Metal Name, “silver”; Stone Shape,“cushion”; Stone Name, “topaz”; Width, “3 mm”; Stone Color, “blue”;Product Type, “rings,” Birthstone, “September”; Setting Type, “prong”;SKU, “CF58489CC”. An example set of Product Attributes 345 for a shoe isas follows: Brand, “Asics”; Category (such as, for example, a Category335 in a category schema or taxonomy), “Men's Sneakers & Athletic”; ShoeSize, “8”; Product Type, “wrestling shoes,” Color, “black”; Shoe Style,“sneakers”; Sports, “athletic”; Upper Material, “mesh”; SKU, “314194009”. When capitalized herein, the foregoing terms (such as Brand,Category, Metal Name, Product Type, etc.) are meant to refer to valuesin a Product Attribute 345 record.

As used herein, “Content” comprises text, graphics, images (includingstill and video images), audio, graphical arrangement, and instructionsfor graphical arrangement, including HTML and CSS instructions whichmay, for example, be interpreted by browser applications.

As used herein, “Event” is information generally in news or currentevents. Events may be found in Content. Listing Pages, Product Pages,and Event Pages are all examples of Webpage Types 350. As used herein,Event Attribute 374 is a record of or relating to an Event, and mayrecord, for example, a sentiment, a time/date, and the like.

As used herein, “PriceDNA” comprises a Product Attribute 345 record, oneor more Price Attribute 340 records, the output of the Core Price Module500 (generally found in the Core Price 380 records), and the output ofthe Insight Module 600 (generally found in Insight 375 records).

As used herein, a “Brand” is a family or group of Products sold by orunder a common trademark, such as the “Nike.®”. Brand, which sells underthis trademark a family of shoes, exercise equipment, and other apparel.Brand is a value within a Product Attribute 345 record.

As used herein, a “Store” is an online or physical sales venue. A Storeis a value within a Price Attribute 340 record.

As used herein, a “Merchant” is an operator of one or more Stores. AMerchant is a value in a Price Attribute 340 record.

Generally, Analysis Module 400 performs Data Ingestion Module 1800 andUser Contact Module 1700. Data Ingestion Module 1800 imports PriceAttribute 340 and Product Attribute 345 records into the Indix Database300 and Get Store Primary 185 shortly after the records are producedfollowing a crawl of webpages accessed via the URIs 305.

Data Ingestion Module 1800 merges the records, performs Core PriceModule 500 to develop core price information, such as changes in price,and exports the records and the result of Core Price Module 500 toSearch-Analytics Primary 175. The result of Core Price Module 500 may besearched and accessed by users in close to real-time, such as within onesecond of when a webpage is crawled. Data Ingestion Module 1800 alsoperforms Insight Module 600. Insight Module 600 comprises a set ofsub-modules for deriving additional information from Price Attribute 340and Product Attribute 345 records and from the output of Core PriceModule 500. Generally, Insight Module 600 identifies what ProductAttributes 345 and Price Attributes 340 across the datasets areassociated with the changes in price. The output of Insight Module 600is also stored in the Indix Database 300 in Search-Analytics Primary 175and may be searched and accessed by users in close to real-time.

The output of Core Price Module 500 and Insight Module 600 are dependenton many cascading analytic dependencies; for example what Products arehigher or lower priced than a given Product? To answer such a questionreliably requires processing information from all available webpages; ifthe answer cannot place a newly crawled webpage into this context, thena party who submitted the new webpage to be crawled and who has to waitfor the answer will be less than confident that the result is up to datewith other changes.

User Contact Module 1700 allows users to search and obtain informationand to set alerts relative to the information in the Indix Database 300.

FIG. 1 is a network and device diagram illustrating exemplary computingdevices configured according to embodiments disclosed in this paper.Illustrated in FIG. 1 are an Indix Server 200, within Indix Server 200,API Handler 170, Query Augmentation Handler 171, History Primary 188,Get Store Primary 185, and Search-Analytics Primary 175.

API Handler 170 may receive and route API calls between various of theother computers; API Handler 170 may have a primary and replicas.

Query Augmentation Handler 171 may weight components of queries fortrending terms and may augment or weight such terms within a queryand/or the results of the query; Query Augmentation Handler 171 may havea primary and replicas. Query Augmentation Handler 171 is discussedfurther herein.

History Primary 188 may store historic records, such as historic GetStore records from Get Store Primary 185 and, optionally, historicSearch and Analytics records from Search-Analytics Primary 175. HistoryPrimary 188 is illustrated as having History Replica-1 189, which deviceis connected to History Shard 1 144 and History Shard N 145. HistoryPrimary 188 may connect directly to the illustrated Shards; one or morereplicas of History Primary 188 may be used. Actions may be delegated bya primary to a replica. Shard Map 393 may be used to log which recordsare stored where; multiple instances of Shard Map 393 may be used.

Get Store Primary 185 is illustrated as connecting to, for example, GetStore Replica-1 186 and Get Store Replica-N 187. Get Store Replica-1 186is illustrated as connecting to, for example, G-S Shard 1 140 and G-SShard 2 141 while Get Store Replica-N 187 is illustrated as connectingto, for example, G-S Shard 3 142 and G-S Shard N 143. The Get Store (or“G-S”) devices may store, for example, granular, highly detailed,complete current Price Attribute 340 and Product Attribute 345 records,including iPID 330, URI 305, image-URIs, and the like. The Get Storedevices are discussed further herein. Actions may be delegated by aprimary to a replica. Shard Map 393 may be used to log which records arestored where; multiple instances of Shard Map 393 may be used.

Search-Analytics Primary 175 is illustrated as connecting to, forexample, Search-Analytics Replica-1 176 and Search-Analytics Replica-N177. Search-Analytics Replica-1 176 is illustrated as connecting to, forexample, S-A Shard 1 136 and S-A Shard 2 137 while Search-AnalyticsReplica-N 177 is illustrated as connecting to, for example, S-A Shard 3138 and S-A Shard N 139. The Search and Analytics (“S-A”) devices maystore, for example, the following types of records: Product Title 395Tokens, the output of Core Price Module 500 (such as Core Price 380records), the output of Insight Module 600 (Insight 375 records), MPID332 and iPID 330 record values (which may be used to find correspondingrecords in Get Store Primary 185 and/or Get Store Replica-1 186 to -N187), and other records which are not granular or detailed in nature,but are the result of processing granular records (such as PriceAttribute 340 and Product Attribute 345 records). Historic records, suchas historic Core Price 380 and Insight 375 records may be moved toHistory Primary 188 and/or History Replica-1 189. The Search andAnalytics devices are discussed further herein. Actions may be delegatedby a primary to a replica. Shard Map 393 may be used to log whichrecords are stored where; multiple instances of Shard Map 393 may beused.

Indix Database 300 is illustrated in FIG. 1 as comprising the S-A andG-S Shards. Indix Database 300 may comprise other records, as well.Indix Database 300 is discussed further in relation to FIG. 3.

Also illustrated in FIG. 1 is a Crawl Agent 438, representing CrawlAgents 1 to N, and a Crawl Agent Database 439. The Crawl Agent 438 andCrawl Agent Database 439 are used to crawl webpages accessed via theURIs 305.

Also illustrated in FIG. 1 is a Client Device 105, such as a mobile ornon-mobile computer device. The Client Device 105 is an example ofcomputing devices such as, for example, a mobile phone, a tablet,laptop, personal computer, gaming computer, or media playback computer.The Client Device 105 represents any computing device capable ofrendering Content in a browser or an equivalent user-interface. ClientDevices are used by “users”. The Client Device 105 may interact with theUser Contact Module 1700.

Also illustrated in FIG. 1 is a Web Server 115, which may serve Contentin the form of webpages or equivalent output in response to URIs, suchas URI 305.

Also illustrated in FIG. 1 is an Ecommerce Platform 160, which mayprovide ecommerce services, such as website and/or webpage hosting viawebpage templates comprising HTML and CSS elements. Customers ofEcommerce Platform 160 may complete the webpage templates with Contentand serve the webpages and websites from, for example, Web Server 115.

Interaction among devices illustrated in FIG. 1 may be accomplished, forexample, through the use of credentials to authenticate and authorize amachine or user with respect to other machines.

In FIG. 1, the computing machines may be physically separate computingdevices or logically separate processes executed by a common computingdevice. Certain components are illustrated in FIG. 1 as connectingdirectly to one another (such as, for example, Crawl Agent 438 to CrawlAgent Database 439), though the connections may be through the Network150. If these components are embodied in separate computers, thenadditional steps may be added to the disclosed invention to recitecommunicating between the components.

The Network 150 comprises computers, network connections among thecomputers, and software modules to enable communication between thecomputers over the network connections. Examples of the Network 150comprise an Ethernet network, the Internet, and/or a wireless network,such as a GSM, TDMA, CDMA, EDGE, HSPA, LTE or other network provided bya wireless service provider, or a television broadcast facility.Connection to the Network 150 may be via a Wi-Fi connection. More thanone network may be involved in a communication session between theillustrated devices. Connection to the Network 150 may require that thecomputers execute software modules which enable, for example, the sevenlayers of the OSI model of computer networking or equivalent in awireless phone network.

This paper may discuss a first computer as connecting to a secondcomputer (such as a Crawl Agent 438 connecting to the Indix Server 200)or to a corresponding datastore (such as to Indix Database 300); itshould be understood that such connections may be to, through, or viathe other of the two components (for example, a statement that acomputing device connects with or sends data to the Indix Server 200should be understood as saying that the computing device may connectwith or send data to the Indix Database 300). References herein to“database” should be understood as equivalent to “datastore”. Althoughillustrated as components integrated in one physical unit, the computersand databases may be provided by common (or separate) physical hardwareand common (or separate) logic processors and memory components. Thoughdiscussed as occurring within one computing device, the software modulesand data groups used by the software modules may be stored and/orexecuted remotely relative to any of the computers through, for example,application virtualization.

FIG. 2 is a functional block diagram of an exemplary Indix Server 200computing device and some data structures and/or components thereof.Indix Server 200 in FIG. 2 comprises at least one Processing Unit 210,Indix Server Memory 250, a Display 240 and Input 245, all interconnectedalong with the Network Interface 230 via a Bus 220. Processing Unit 210may comprise one or more general-purpose Central Processing Units(“CPU”) 212 as well as one or more special-purpose Graphics ProcessingUnits (“GPU”) 214. The components of Processing Unit 210 may be utilizedby Operating System 255 for different functions required by the modulesexecuted by lndix Server 200. Network Interface 230 may be utilized toform connections with Network 150 or to form device-to-deviceconnections with other computers. Indix Server Memory 250 generallycomprises a Random Access Memory, RAM 251, a Read Only Memory, ROM 252,and a permanent mass storage device, such as a Disk Drive or SDRAM(synchronous dynamic random-access memory), DD 254, a Solid State Drive,SSD 253, and hybrids thereof. As discussed herein, data which requiresvery fast access time may be held in RAM 251, data requiring fast accesstime may be held in SSD 253, and other data may be held in DD 254,taking advantage of the different read/write speeds and the differentcosts and memory density of these different types of memory.

Indix Server Memory 250 stores program code for software modules, suchas, for example, Analysis Module 400, Core Price Module 500, InsightModule 600, Volatility Module 700, Substitution Module 800, Mix Module900, Prediction Module 1000, Competition Module 1100, Promotion Module1200, Leadership Module 1300, Premium Module 1400, Price Range Module1500, Reach Module 1600, User Contact Module 1700, Data Ingestion Module1800, Query Module 1900, and Query Augmentation Module 260 as well as,for example, browser, email client and server modules, clientapplications, and database applications (discussed further below).Additional data groups for routines and modules, such as for a webserverand web browser, may also be present on and executed by Indix Server 200and the other computers illustrated in FIG. 1. Webserver and browsermodules may provide an interface for interaction among the computingdevices, for example, through webserver and web browser modules whichmay serve and respond to data and information in the form of webpagesand html documents or files. The browsers and webservers are meant toillustrate machine- and user-interface and user-interface enablingmodules generally, and may be replaced by equivalent modules for servingand rendering information to and in interfaces in a computing device(whether in a web browser or in, for example, a mobile deviceapplication).

In addition, Indix Server Memory 250 also stores an Operating System255. These software components may be loaded from a non-transientComputer Readable Storage Medium 295 into Indix Server Memory 250 of thecomputing device using a drive mechanism (not shown) associated with anon-transient Computer Readable Storage Medium 295, such as a floppydisc, tape, DVD/CD-ROM drive, memory card, or other like storage medium.In some embodiments, software components may also or instead be loadedvia a mechanism other than a drive mechanism and Computer ReadableStorage Medium 295 (e.g., via Network Interface 230).

Computing device 200 may also comprise hardware supporting inputmodalities, Input 245, such as, for example, a touchscreen, a camera, akeyboard, a mouse, a trackball, a stylus, motion detectors, and amicrophone. Input 245 may also serve as a Display 240, as in the case ofa touchscreen display which also serves as Input 245, and which mayrespond to input in the form of contact by a finger or stylus with thesurface of Input 245.

Computing device 200 may also comprise or communicate via Bus 220 withIndix Datastore 300, illustrated further in FIG. 3. In variousembodiments, Bus 220 may comprise a storage area network (“SAN”), a highspeed serial bus, and/or via other suitable communication technology. Insome embodiments, Indix Server 200 may communicate with Indix Datastore300 via Network Interface 230. Indix Server 200 may, in someembodiments, include many more components than those shown in thisFigure. However, it is not necessary that all of these generallyconventional components be shown in order to disclose an illustrativeembodiment.

FIG. 3 is a functional block diagram of Indix Datastore 300 illustratedin the computing device of FIG. 2. The components of Indix Datastore 300are data groups used by modules and are discussed further herein in thediscussion of other of the Figures. The data groups used by modulesillustrated in FIG. 3 may be represented by a cell in a column or avalue separated from other values in a defined structure in a digitaldocument or file. Though referred to herein as individual records orentries, the records may comprise more than one database entry. Thedatabase entries may be, represent, or encode numbers, numericaloperators, binary values, logical values, text, string operators, joins,conditional logic, tests, and similar.

FIG. 4 is a flowchart illustrating an embodiment of an Analytics Module400. The Analytic Module 400 may be performed by, for example, IndixServer 200.

At block 405, Analytic Module 400 obtains a new set of Parse Result 325records, such as Parse Result 325 records produced by, for example,Parser Routine 700 described in U.S. patent application Ser. Nos.14/726,707, 13/951,244, and U.S. provisional patent application No.61/675,492 (incorporated herein and in the present document'scross-reference to related applications). Such records may comprisePrice Attribute 340 and Product Attribute 345 records, which records areobtained from a crawl of a URI 305, which URI 305 has a derived iPID 330(which may be a hash of URI 305), and with an assigned MPID 332 andCategory 335. This may occur as frequently as URIs 305 are crawled andwebpages therefrom parsed into Parse Results 325.

Opening loop block 410 to closing loop block 415 may iterate for sets ofParse Result 325 records of block 405. Sets of Parse Result 325 recordsmay be sets comprising, for example, Parse Result 325 records obtainedfrom crawls of a common parent domain name or a group of domain nameswhich are known to be used by the same Store or Merchant.

At block 1800, Analytics Module 400 may execute Data Ingestion Module1800. As described further herein, Core Price Module 500 and InsightModule 600 may be executed by or as part of Data Ingestion Module 1800,such that the results therefrom may be stored, indexed, and may be madeavailable to be searched and found substantially as Parse Result 325records are obtained.

At block 1700, Analysis Module 400 performs User Contact Module 1700.Utilizing User Contact Module 1700, users may query the Indix Database300 and set alerts. User Contact Module 1700 may execute Query Module1900, which allows search and recovery of the results of Core PriceModule 500 and Insight Module 600 substantially as Parse Result 325records are obtained.

At concluding block 499, Analysis Module 400 may conclude or return to aprocess which spawned it.

FIG. 5 is a flowchart illustrating an embodiment of Core Price Module500. Core Price Module 500 may be called or spawned by Data IngestionModule 1800. Opening loop block 505 to closing loop block 520 mayiterate for each iPID 330 with a new Price Attribute 340 record in therecord or group of records processed by Data Ingestion Module 1800.

At block 510 current Price Attribute 340 records and previous PriceDNAassociated with the iPID 330 may be obtained, including a new PriceAttribute 340 record and, if not already present in a stored formula,historic records (and/or summary values derived therefrom), such as fromHistory Primary 188.

At block 515, the high, low, average, mean, magnitude, and number ofprice values over several time periods for the iPID 330 may becalculated. A default time period may be 45 or 30 days, though thesevalues may be calculated for several time periods. This may beperformed, for example, by saving the Price Attribute 340 record valuesand re-executing a formula, which formula may call such saved recordvalues. The result may be saved, for example, to the Core Price 380records.

At closing loop block 520, Core Price Module 500 may return to openingloop block 505 to iterate over the next iPID.

Opening loop block 525 to closing loop block 535 may iterate for eachMPID 322 associated with an iPID 330 in the record or group of recordsof blocks being processed by Data Ingestion Module 1800. At block 530,the high, low, average, mean, magnitude, and number of price values overseveral time periods may be calculated for the MPID 332 utilizing thenew value associated with the iPID 330 from block 515. The iPID 330 maybe a hash of a URI 305 and the result of block 515 is thus limited to aparticular sales channel (typically a Store) for a particular Product(taking into account that duplicate iPIDs 330 from a base domain namemay be treated as equivalent); an MPID 332 is assigned to all iPIDS 330which represent the same Product, so the MPID version of thiscalculation in block 530 returns values relating to the Product acrossStores, Merchants, Locations, etc. The calculation of block 530 mayreturn values which are or may be sorted by, for example, Store,Merchant, Location (such as Region), and by time periods such as aSeason. The output may be saved, for example, to the Core Price 380records, and indexed. At closing loop block 535, Core Price Module 500may return to opening loop block 525 to iterate over the next MPID 322associated with an iPID 330 in the record or group of records of blocksbeing processed by Data Ingestion Module 1800.

At block 540, all calculations and other modules which utilize thevalues for the iPID 330 from block 515 and the associated MPID 332 fromblock 530 may insert the new or updated values and may perform arecalculation. For example, the high, low, average, mean, magnitude, andnumber of price changes over time periods by Category 335, such as aCategory 335 associated with the iPID 330, may be calculated. The outputmay be saved, for example, to the Core Price 380 records, and indexed.

Calculations or other modules which utilize the values calculated inFIG. 5 may refer to data addresses. The Core Price Module 500 may updatethe values stored at these data addresses, which causes the calculationsor other modules to update their output, when such calculations or othermodules are (re)executed, such on a schedule or on the occurrence of anevent.

At block 599, the Core Price Module 500 may return, for example, to DataIngestion Module 1800, which may return to Analysis Module 400.

FIG. 6 is a flowchart illustrating an embodiment of Insights Module 600.

Insights Module 600 may perform one or more of a set of sub-modules. Atblock 700, Volatility Module 700 may be performed to determine thevolatility of prices relative to the many dimensions available in thePriceDNA. At block 800, Substitution Module 800 may determinesubstitutes for an iPID 330, MPID 332, or Category 335. At block 900,Mix Module 900 determines “how many” relative to the many dimensionsavailable in the PriceDNA. At block 1000, Prediction Module 1000 makesprice predictions relative to the many dimensions available in thePriceDNA. At block 1100, Competition Module 1100 determines competitorsrelative to a Product, Store, or Brand. At block 1200, Promotion Module1200 determines promotions relative to Products, Stores, Brand, Seasons,and other dimensions available in the PriceDNA. At block 1300,Leadership Module 1300 determines which Products lead or follow othersin terms of price changes. At block 1400, Premium Module 1400 determineswhich Products in Category 335 charge higher (premium) prices. At block1500, Price Range Module 1500 determines the number of price ranges andmaximum and minimum for iPIDs, MPIDs, and categories. At block 1600,Reach Module 1600 determines the reach of an iPID or MPID in terms ofthe number of people who visit a sales venue.

At block 699, Insights Module 600 may conclude and/or return to processwhich spawned it.

FIG. 7 is a flowchart illustrating an embodiment of a Volatility Module700. At block 705, the Prices associated with an iPID 330 over a timeperiod, such as 30 days, may be obtained, such as from Core Price 380records. At block 710, the number of price changes within the timeperiod may be determined (if this was not already a value in the CorePrice 380 records). At block 715, the number of price changes within thetime period (“VBF”) may be determined relative to, for example, the iPID330, relative to an MPID 332 associated with the iPID 330, relative to aBrand, relative to a Region, relative to a Price Band by MPID 332,relative to a Category 335, and relative to all iPIDs 330 associatedwith a Merchant. The values may be saved and indexed to accelerateaccess to and/or enable searching for the values and/or the values maybe calculated on an as-needed basis. The values may be saved to Insight375 records.

At block 720, the benchmark number of Price changes in the period oftime may be determined. The benchmark may be, for example, the VBFrelative to additional criteria, such as, for example, the VBF for aProduct (or MPID), plus 1, divided by the maximum VBF of other Productsin the same Category as the Product, multiplied by 100 over 101. Thebenchmark VBF for a Category may be determined by the VBF for theCategory, plus 1, divided by the maximum VBF of the Category, multipliedby 100 over 101. The benchmark VBF for a Merchant may be the VBF of theMerchant, plus 1, divided by the maximum VBF of the Merchant, multipliedby 100 over 101. The benchmark VBF for a Brand may be the VBF of theBrand, plus 1, divided by the maximum VBF of the Brand, multiplied by100 over 101. The values may be saved to Insight 375 records andindexed.

FIGS. 8A-8C are flowcharts illustrating embodiments of SubstitutionModule 800, labeled as 800-A, 800-B, and 800-C. In a first example of anembodiment of Substitution Module 800 illustrated in FIG. 8A asSubstitution Module 800-A, substitute Products within Category 335 areidentified. At block 801, which, like other steps may be optional, aProduct may be identified by, for example, a user or a module, and theMPID 332 corresponding thereto may be obtained. At block 805, a Category335 may be obtained, whether corresponding to the Product and MPID ofstep 801 or via a user query or other input, and all MPIDs 332 withinthe Category 335 may be obtained. At block 810 a Price Band may beobtained or calculated relative to the Category 335 (such as from oraccording to the Price Range Module 1500); the Price Band may beselected by a user. Blocks 815 through 830 may iterate for each iPID 330within the Category of block 805.

At block 820, the iPIDs 330 in the Category of block 805 and with aPrice value within the Price Band of block 810 are identified, such asfrom the Core Price 380 records. At block 825, the result of block 820may be subdivided, grouped, or filtered by Region, Time, Used/New, andaccording to other dimensions available in the PriceDNA. At block 830the Substitution Module 800 may iterate over the remaining iPIDs 330 inthe Category 335. At block 835, the results may be saved as Substitutes,such as to Insight 375 records. At block 839, the process may return toa process which spawned it.

In a second example of an embodiment of a Substitution Module 800illustrated in FIG. 8B and Substitution Module 800-B, substituteProducts within a Category 335 with a percentage overlap in Attributes340/345 and within a Price Band are identified. At block 840, a Category335 may be obtained, whether corresponding to a Product or via a userquery or other input, and all MPIDs 332 within the Category 335 may beobtained. At block 845, the Product Attributes 345 of all iPIDS 330within the MPIDs 332 may be obtained. At block 850, the ProductAttributes 345 may be clustered to identify the iPIDs 330 with at leasta 50% Product Attribute 345 match or overlap. At block 855 a Price Bandmay be obtained or calculated relative to the Category 335 (such as fromor according to the Price Range Module 1500); the Price Band may beselected by a user.

Blocks 860 through 870 may iterate for each iPID 330 within the MPIDs332 and Attribute 345 match of block 850. At block 865, the iPIDs 330with a Price value within the Price Band of block 855 and with theProduct Attribute 345 match or overlap of block 850 are identified. Theresult of block 865 may be subdivided or grouped further by sub-PriceRanges. At block 870 the Substitution Module 800 may iterate over theremaining iPIDs 330 in the MPIDs 332 within the Category 335. At block871, the results may be saved as Substitutes in Insight 375 records. Atblock 874, the process may return.

In a third example of an embodiment of a Substitution Module 800illustrated in FIG. 8C and Substitution Module 800-C, substituteProducts within a Category 335 with a percentage overlap in Attributes340/345 and in the top or bottom of a Price Range are identified. Atblock 875, a Category 335 may be obtained, whether corresponding to aProduct or via a user query or other input, and all MPIDs 332 within theCategory 335 may be obtained. At block 880, the Product Attributes 345of all iPIDS 330 within the MPIDs 332 may be obtained. At block 885, theProduct Attributes 345 may be clustered to identify the iPIDs 330 withat least a 50% Product Attribute 345 match or overlap.

Blocks 890 through 897 may iterate for each iPID 330 within the MPIDs332 and Attribute 345 match of block 885. At block 895, the iPIDs 330with the Product Attribute 345 match or overlap of block 885 and in thebottom of a Price Range or Price Band relative to the starting iPID 330are identified. At block 896 the top or bottom five (or another subset)of block 895 may be selected. At block 897 this embodiment of theSubstitution Module 800 may iterate over the remaining iPIDs 330 in theMPIDs 332 within the Category 335. At block 898, the results may besaved as Substitutes in Insight 375 records. At block 899, the processmay return.

FIG. 9 is a flowchart illustrating an embodiment of Mix Module 900. MixModule 900 determines “how many” relative to the many dimensionsavailable in the PriceDNA. At block 905, Mix Module 900 obtains a firstsegmentation criteria, such as, for example, a Product Name, Brand, orCategory. At block 910, a first sub-segmentation criteria may beobtained, such as, for example, a Store, Location, or Price Band. Atblock 915, a second sub-segmentation criteria may be obtained, such as,for example, a Store, Location, or Price Band. At block 920, the numberof Products, such as by MPID 332, which meet the criteria of blocks 905,910, and 915 may be counted. At block 925, the result of block 920 maybe subdivided or grouped by Location, Time, Season, Price Band, Used/Newor other dimensions available in the PriceDNA. At block 930, the resultsof blocks 920 and/or 925 may be saved as Mix values in Insight 375records. At block 999, the process may return.

FIG. 10 is a flowchart illustrating an embodiment of Prediction Module1000. Prediction Module 1000 makes price predictions relative to themany dimensions available in the PriceDNA. At block 1005, PredictionModule 1000 obtains a Product and obtains or identifies an MPID 332and/or iPIDs 330 associated therewith. At block 1010, the last Price ofthe Product by MPID 332 and/or iPID 330 may be obtained, such as fromCore Price 380 records. At block 1015, first and second linearregression parameters may be calculated or obtained.

At block 1020, to the first parameter may be added the second parametermultiplied by the last price of the Product from block 1010. At block1025 an error term may be added to the result of block 1020. At block1030 a confidence interval may be calculated. At block 1035 the resultmay be saved as Predictions in Insight 375 records. At block 1035 thePrediction Module 1000 may then return.

In FIG. 10, the predicted Price for a product may be determinedaccording to the following equation: p_(t)=α+βp_((t-1))+ϵ, where p_(t)is the price at time t, α and β are the parameters of the linearregression and ϵ is the error term and is assumed to be Normallydistributed. Confidence, C, is a measure that represents the chance formaking 0.01% error in predicting the price of the product,

$C = {{{{normsdist}(Z)}\mspace{14mu} {and}\mspace{14mu} Z} = {\frac{{.01}\%*{Price}}{( {{Std}.{Error}} )}.}}$

In this formula, the parameters of the model are estimated using theoriginal least squares method as follows:

$\hat{\beta} = {{\frac{( {{\sum\; {p_{({t - 1})}p_{t}}} + {\frac{1}{n}{\sum\; {p_{({t - 1})}{\sum p_{t}}}}}} )}{( {{\sum\; p_{({t - 1})}^{2}} - {\frac{1}{n}( {\sum\; p_{({t - 1})}} )^{2}}} )}\mspace{14mu} {and}\mspace{14mu} \hat{\alpha}} = {{\overset{\_}{p}}_{t} - {\hat{\beta}\overset{\; \_}{p_{({t - 1})}}}}}$

FIG. 11 is a flowchart illustrating an embodiment of Competition Module1100. Competition Module 1100 determines competitors relative to aStores, Brands, or Merchants. At block 1105, a first and second (ormore) Store, Brand, or Merchant may be obtained, along with an optionalCategory 335. These may be obtained from a user or another module. Atblock 1110, all Products sold by or under each of the entities of block1105 may be obtained, such as from the PriceDNA. The Products mayoptionally be filtered by the Category of block 1105.

At block 1115, a determination may be made regarding whether or not theentities of block 1105 have 70% or more overlapping Products, per theProducts of block 1110. The affirmative output of this block may besaved as Competitors in Insight 375 records.

At block 1120, the Competitors may be filtered by, for example, on ormore of Store, Substitute, Substitute by Price Band, Brand, Location(including Region), Time (including Season), and whether the Productsare sold as used or new. Which criteria are used in the filter may bedetermined by input from a user. The output of block 1120 may be savedin Insight 375 records.

At block 1125, the average price of Products in the Category 335 ofblock 1105 may be obtained relative to, for example, Category 335,Substitute, Substitute by Price Band, Brand, Location, Time, used/newstatus, and other criteria. At block 1130, the output of block 1125 maybe ranked and saved as Price Competitiveness in Insight 375 records.

At block 1135, a Store and Location for a target Product may beobtained, such as from a user. At block 1145, the Competitors from block1115 may be obtained or determined and the Competitors filtered toselect only Competitors with sales in the Location of block 1135. Atblock 1145, Stores in the Location which are the same as the Store ofblock 1135 may be removed from the set of Competitors, leaving theremainder (those not removed).

At block 1150, the output of block 1150 may be placed in a VoroniDiagram or similar data structure, with the location in the VononiDiagram being based on physical location of the Stores of theCompetitors. Generally, a Voroni Diagram determines the distance betweenobjects in a geometric manner, rather than a power-law manner. At block1155, the distance between the target Store and each Competitor may beranked. At block 1160, the output of block 1160 may be saved as ReachCompetitiveness in Insight 375 records.

FIG. 12 is a flowchart illustrating an embodiment of Promotion Module1200. Promotion Module 1200 determines promotions relative to Products,Stores, Brand, Seasons, and other dimensions available in the PriceDNA.At block 1205, a Product may be obtained, such as from user input, andthe MPID 332 and/or an IPID 330 corresponding to the Product may beidentified in the Attributes 340/345 (via, for example, the SequentialFile 365). The Product may be a single Product or a Bundle comprisingmultiple Products. At block 1210, a “Promotion” value may be identifiedin the Attributes 340/345 associated with the MPID 332 and/or IPID 330;the “Promotion” value may be a Sale Price and/or a Promotion Code in thePrice Attribute 340 records associated with the MPID 332 and/or IPID330. Alternatively, at block 1210 the Price history for the MPID 332and/or IPID 330 may be graphed.

At block 1215, the number, length, date/time, and magnitude of thePromotions may be determined and saved as Promotions in Insight 375records. Alternatively, the number, length, date/time, and magnitude ofthe low-points in the graph of block 1210 may be determined and saved asPromotions in Insight 375 records. At block 1220, the output of block1215 may be filtered by criteria such as, for example, date/time, PriceBand, Location (including Region), Season, and Holidays. The criteriamay be received from, for example, a user and/or a default set ofcriteria may be applied, with the result of each being saved in Insight375 records.

At block 1225 a time period and a Merchant may be obtained, such as froma user; the Merchant may be associated with the Product of block 1205.At block 1225, the number of Products sold by the Merchant in Promotionduring the time period may be determined.

At block 1230, the result of block 1215 may be benchmarked relative toaverage Promotion times, durations, and magnitude for other Products(including other Bundles of the Product), the timing of Promotions forother Products, relative to the magnitude of Promotions for otherProducts, relative to the Products associated with a Brand, relative toall Products sold at a Store, relative to Products in a Price Band, andrelative to Competitors and Substitutes. The result may be saved inInsight 375 records.

FIG. 13 is a flowchart illustrating an embodiment of Leadership Module1300. Leadership Module 1300 determines which Products lead or followothers in terms of price changes. At block 1305, a Product may beobtained, for example, from a user or another module, and the associatedMPID 332 determined. At block 1310 Substitutes for the Product may beobtained (such as from or by the Substitutes 800 module). At block 1315,the change in Price, or Price delta, for the Product and the Substitutesmay be determined over periods of time. The Price delta may bedetermined in an absolute sense (whether the change was positive ornegative) and/or with a determination of the magnitude of the Pricedelta.

At block 1320, the Price deltas determined at block 1315 may be matched,to determine if any of the Price deltas with the same absolute value(positive or negative) occurred within a time window of one another(deltas beyond the time window may not be considered to be correlated),with the result being saved as a Leader/Follower indication in Insight375 records.

At block 1325, the matching Price deltas of block 1320 may be graphedaccording to time. At block 1330, the result of block 1325 may befiltered by criteria such as Region, Rime, Date/Time, Season, PriceBand, and Store.

At block 1335, the number of Leaders and Followers may be determinedrelative to a time period. At block 1340, the average lead/follow timemay be determined. At block 1345, leaders/followers with respect toexact Product matches (for different Stores selling the same Product,determined at block 1330) may be identified. At block 1350, the resultsmay be benchmarked relative to the number of leaders/followers and othercriteria. The result of various of the blocks in FIG. 13 may be saved inInsight 375 records. At block 1399, the Leadership Module 1300 mayreturn.

FIG. 14 is a flowchart illustrating an embodiment of Premium Module1400. Premium Module 1400 determines which Products (generally, by MPID)in a Category 335 charge higher Prices (premium). At block 1405, aProduct may be received, such as from input by a user or another module.At block 1410, the Substitutes for the Product may be determined orobtained from another module, such as the Substitution 800 module and/orInsight 375 records. At block 1415, the Prices of the Product and of theSubstitutes may be obtained, such as from the Core Price 380 records. Atblock 1420, the obtained Prices of block 1415 may be graphed or mappedand the top of the Price distribution identified. The top of the Pricedistribution may be the top five or ten percent or the top five Productsor Substitutes may be identified and saved as the “Premium” Products inInsight 375 records.

At block 1425, the Product Attributes 345 of the Products andSubstitutes of block 1410 may be obtained and clustered by similarity.At block 1430, the Product Attributes 345 unique to or dominant in thePremium Products, determined by the clusters of block 1425, may beidentified and saved in Insight 375 records.

At block 1435, user votes regarding Product Attributes 345 of PremiumProducts may be received. At block 1440, the user votes may be talliedand, at block 1445, the “winning” Product Attributes 345 (with the mostvotes) may be set as the Product Attributes 345 associated with thePremium Products in Insight 375 records.

FIG. 15 is a flowchart illustrating an embodiment of Price Range Module1500. Price Range Module 1500 determines the number of price ranges andmaximum and minimum for iPIDs, MPIDs, and categories. At block 1505, aProduct may be obtained, such as from a user or another module. At block1510, the Prices for the Product may be obtained, such as from thePriceDNA for the Product. At block 1515, the Prices of block 1510 may beclustered by similarity and with a minimum cluster size, with the rangein Price across each cluster being saved as Price Ranges for the Productin Insight 375 records.

At block 1520, the Channel Range for the Product may be set as theminimum and maximum of the Prices of block 1510 and saved in Insight 375records. At block 1525, the results of blocks 1510, 1515, and 1520 maybe filtered by, for example, Region, Date/Time, and according to othercriteria and saved in Insight 375 records. At block 1530, the PriceRanges may be determined relative to all Products in a Category 335, allProducts by a Brand, and relative to a benchmark which may be, forexample, the maximum number of Price Ranges within a Category 335. Theresult thereof may be saved as Price Ranges in Insight 375 records.

FIG. 16 is a flowchart illustrating an embodiment of Reach Module 1600.Reach Module 1600 determines the reach of an iPID or MPID in terms ofthe number of people who visit a sales venue. At block 1605, a Productmay be obtained, such as from a user or another module. At block 1610,the Stores offering the Product for sale may be obtained. At block 1615,the traffic at the stores may be obtained, such as from a source foronline webpage/website traffic, such as Alexa or similar. At block 1620,the result of block 1615 may be filtered by, for example, criteria suchas Date/Time (including Season), Location (including Region), Holiday,and other criteria. The result thereof may be saved as Reach in Insight375 records. At block 1699, the Reach Module 1600 may return.

FIG. 17 is a flowchart illustrating an embodiment of User Contact Module1700. At block 1705, a user contact with the User Contact Module 1700may be detected. The user contact may be part of a user-interface servedby User Contact Module 1700, which user-interface allows users to inputqueries and see results, relative to data in Indix Datastore 300. Atblock 1710, a user query may be received, such as for PriceDNA recordsand/or Insight records. At block 1900, User Contact Module 1700 mayexecute Query Module 1900 to execute the query and return results to theuser.

Opening loop block 1715 to closing loop block 1740 may iterate for eachuser query. At block 1720, a determination may be made regarding whetherthe user has requested that the query be stored as an alert. If so, thenat block 1725 a time period for the alert may be obtained or set (suchas according to a default time period, such as once per day or week). Atblock 1730, on occurrence of the time period of block 1725, the querymay be executed, such as by execution of Query Module 1900. At block1735, an alert or other message may be sent to contact informationassociated with the user. At block 1799, the User Contact Module 1700may conclude.

FIG. 18 is a flowchart illustrating an embodiment of Data IngestionModule 1800. Opening loop block 1805 to closing loop block 1860 mayiterate for sets of Parse Result 325 records produced by, for example,Parser Routine 700 described in U.S. patent application Ser. Nos.14/726,707, 13/951,244, and U.S. provisional patent application No.61/675,492 (in the present document's cross-reference to relatedapplications). The sets of Parse Result 325 records may be setscomprising, for example, Parse Result 325 records obtained from crawlsof a common parent domain name or a group of domain names which areknown to be used by the same Store or Merchant.

Opening loop block 1810 to closing loop block 1825 may iterate for eachParse Result 325 set of each iPID 330 within the the-current set ofParse Result 325 records, which Parse Result 325 contains ProductAttribute 345 and Price Attribute 340 records.

At block 1815, Product Attribute 345 and Price Attribute 340 recordswith respect to the then-current iPID 330 may be joined by access time(or by such records which occur within a narrow access time window) toform one logical record reflecting the Product Attribute 345 and PriceAttribute 340 records obtained with respect to the iPID 330 at one time(or narrow access time window).

At block 1820, Data Ingestion Module 1800 may determine a geographicregion (“geo”) associated with the iPID 330. The geo may be determinedbased on, for example, a country code in a URI, based on countryidentification in the Parse Result 325, based on currency in the ParseResult 325, or the like.

At closing loop block 1825, Data Ingestion Module 1800 may return toopening loop block 1810 to iterate over the next Parse Result 325 set ofeach iPID 330 within the the-current set of Parse Result 325 records.

At block 1830, Data Ingestion Module 1800 may group iPID 330 recordsaccording to MPID 332. If an iPID 330 was not previously selected to bean MPID 332 for a group of iPID 330 records or for an individual iPID330 record, then the sole iPID 330 may be assigned as an MPID 332 forits own group (itself).

At block 1835, Data Ingestion Module 1800 may obtain one or morecategories associated with the MPID 332 of block 1835, such as accordingto a Category 335 record stored in Indix Datastore 300, as may have beenassigned by, for example, Assigner Routine 1200 as described in U.S.patent application Ser. Nos. 14/726,707 and 13/951,244 in relation toMPID.

At block 1840, based on the geo of block 1820 and the Category 335 ofblock 1835, Data Ingestion Module 1800 may determine a Get Store Primary185 and/or Get Store Replica-1 186 to -N 187 and a Search-AnalyticsPrimary 175 and/or Search-Analytics Replica-1 176 to -N 177 alreadyassociated with the geo of block 1820 and the Category 335 of block1835. This may be obtained from, for example, Shard Map 393. If notalready set, Data Ingestion Module 1800 may set a Get Store Primary 185and/or Get Store Replica-1 186 to -N 187 and a Search-Analytics Primary175 and/or Search-Analytics Replica-1 176 to -N 177 to associate withthe geo of block 1820 and the Category 335 of block 1835. This may berecorded in, for example, Shard Map 393. Thus, Get Store Primary 185 andReplicas thereof and Search-Analytics Primary 175 and Replicas thereof(and Shard(s) within both) may both be organized based on Category 335and geo. Periodically, the Categories and geo within the Shards may beredistributed or rebalanced.

At block 1845, Data Ingestion Module 1800 may store G-S Records in theGet Store repositories identified in block 1840. For example, thefollowing types of records may be stored in Get Store Primary 185 and/orGet Store Replica-1 186 to -N 187: granular, highly detailed, completecurrent Price Attribute 340 and Product Attribute 345 records, includingiPID 330, URI 305, image-URIs, and the like. New or updated PriceAttribute 340 records associated with an iPID 330 may be stored in theGet Store repositories identified in block 1840, while historic recordspreviously in the Get Store repositories may be moved to, for example,History Repository 188 (each iPID 330 may be associated with a set ofPrice Attribute 340 records, a current record and historic records).With respect to Product Attribute 345 records, Data Ingestion Module1800 may merge the most recent Product Attribute 345 record into aProduct Attribute 345 record associated with each iPID 330 (each iPID330 may be associated with one Product Attribute 340 record). In thismerger, new values overwrite old values unless the old record is longeror unless the old record otherwise is judged to be of higher quality(such as if the old record uses fewer words, but the words are lesscommon than the words in the new record); if a new record does not havea value where an old value exists, the old value may be left.

If not already performed, at block 2000, Data Ingestion Module 1800 mayexecute Get Store Index Module 2000, to index the records stored in GetStore Primary 185 and/or Get Store Replica-1 186 to -N 187.

At block 500, Data Ingestion Module 1800 performs Core Price Module 500(described further herein), utilizing the updated records in Get Storerepositories.

At block 600, Data Ingestion Module 1800 performs the Insight Module 600utilizing and expanding upon the output of the Core Price Module 500.Generally, Insight Module 600 identifies what Product Attributes 345 andPrice Attributes 340 across the datasets are associated with the changesin price.

At block 1850, Data Ingestion Module 1800 stores the output of CorePrice Module 500 and of Insight Module 600 in the Indix Database 300 asInsight 375 in the S-A repositories identified in block 1840. Forexample, the following types of records may be stored inSearch-Analytics Primary 175 and/or Search-Analytics Replica-1 176 to -N177: Product Title 395 tokens, the output of Core Price Module 500 (suchas Core Price 380 records), the output of Insight Module 600 (Insight375 records), MPID 332 and iPID 330 record values (which may be used tofind corresponding records in Get Store Primary 185 and/or Get StoreReplica-1 186 to -N 187), and other records which are not granular ordetailed in nature, but are the result of processing granular records(such as Price Attribute 340 and Product Attribute 345 records).Historic records, such as historic Core Price 380 and Insight 375records may be moved to History Primary 188 and/or History Replica-1189.

If not already performed, at block 2100, Data Ingestion Module 1800 mayexecute Search and Analytics Index Module 2100, to index the recordsstored in Search-Analytics Primary 175 and/or Search-Analytics Replica-1176 to -N 177.

At block 1855, Shard Map 393 may be updated, for example, to reflectchanges to the Primary/Shard structure. For example, as records arestored in and moved between data stores, existing Shards may be combinedand/or new Shards may be created to accommodate changes in the number ofrecords associated with Category 335 and geo. Other changes in thePrimary and Shard structure may also be implemented, which may requireupdates to Shard Map 393.

At closing loop block 1860, Data Ingestion Module 1800 may return toopening loop block 1805 to iterate over the next set of Parse Result 325records, if any.

At concluding block 1899, Data Ingestion Module 1800 conclude and/orreturn to a process which may have spawned it.

FIG. 19 is a flowchart illustrating an embodiment of Query Module 1900.Query Module 1900 may be executed with respect to user queries. A userquery may comprise a “free text” search, which may be converted to aBoolean text search with an inferred relationship with specific recordsin Indix Datastore 300. A user query may also be constructed by enteringvalues into structured search fields, which search fields are tied tospecific records in Indix Datastore 300. Certain of the records in theIndix Datastore 300 are highly structured, with many hierarchicalrelationships. For example, a group of iPIDs 330 for a common Productwill have a common MPID 332. An MPID 332 can be used to identify theiPIDs 330 in the MPID 332. Merchants have Stores. These and many otherhierarchical relationships exist among the data records in the IndixDatastore 300. In addition, many of the records in the Indix Datastore300 have high cardinality. Cardinality generally describes the number ofunique elements in a set. A set comprising a large number of differentor unique elements has high cardinality. An example of a highcardinality record set is one comprising UPCs, SKUs, MPNs, or URIs (iPID330 may be derived from a URI, this may be reversible). Many traditionalsearch engines have a very flat data model, without hierarchicalrelationships; many traditional search engines do not work well withrecord sets with high cardinality. Query Module 1900 must be able torapidly and efficiently search records which have hierarchicalrelationships and with respect to queries which may or may not beaddressed to record sets which have high cardinality. Modules which feeddata into these record sets (such as Get Store Index Module 2000, CorePrice Module 500, Insight Module 600, and Search and Analytics Module2100) must be able to operate very rapidly, so as to allow a user toinput websites to be crawled, have them crawled, and have Parse Results325 and data therefrom be available within seconds.

Opening loop block 1901 to closing loop block 1960 may iterate for eachuser query.

At block 1905, Query Module 1900 may identify a high cardinality LookupKey in the user query, such as if the user query contains a field whichis identifiable as or is identified as a UPC, SKU, MPN, or a URI. Thepresence of a Lookup Key is not required, but, if present, should beaddressed to shorten the search process.

At block 1910, Query Module 1900 may generate a hash of the Lookup Keyin the query. At block 1915, Query Module 1900 may compare the hash ofblock 1910 with Lookup Key Hash 385 records (previously generated duringexecution of Get Store Index Module 2000) and determine, via thiscomparison, which G-S Shard has the highest percentage overlap with theLookup Key in the user query.

At block 1920, Query Module 1900 may identify the Primary and/orReplica(s) associated with the G-S Shard(s) identified in block 1915.Because the G-S Shards and S-A Shards may both be organized by Category335 and geo, (though the memory in each contains different types ofrecords), they have a similar indexing structure, so identification ofwhich G-S Shard has the highest percentage overlap with the Lookup Keyin the user query may also be used to identify which S-A Shard(s) arelikely to be responsive to the user query.

At block 1906, Query Module 1900, such as utilizing Query AugmentationHandler 171 and Query Augmentation Module 260, may weight components ofthe query for trending terms in popular culture and may determine one ormore Category 335 for the query. For example, the term “apple” may referto a computer, a line of fashionable clothes, or to a type of fruit.Query Augmentation Module 260 may detect that the “apple” clothing lineis trending on other search engines, in social media, and the like andmay associate the query with a “clothing” Category 335. Block 1906 mayfurther identify a geographic area or “geo” associated with the query.This may be based on a location provided by, for example, the user inthe query (“red Nike shoes in Seattle”) or in account or otherinformation provided by the user, or it may be obtained from an IPaddress or other location information obtained in conjunction with thequery.

At block 1907, Query Module 1900 may identify the S-A Shard(s) whichhave an overlap with or which overlap the most with the Category 335 andgeo identified in block 1906 (the Shard(s) being organized by Categoryand geo by Data Ingestion Module 1800).

At block 1908, Query Module 1900 may further resolve the query to S-AShard(s), such as within group of S-A Shard(s) identified in block 1907,according to Inverted Bitmap Index 394. Inverted Bitmap Index 394 mayhave been created by, for example, Search and Analytics Index Module2100, which may be a subroutine of Data Ingestion Module 1800. InvertedBitmap Index 394 may comprise, for example, an inverted bitmap index ofsignificant terms such as Brand, Category, Store, geography and ProductTitle 395 components, which terms may be of low cardinality. Resolvingthe query to S-A Shard(s) according to Inverted Bitmap Index 394 mayreturn a list of S-A Shards, ranked according to highest correlationwith the query.

At block 1922, the Primary(ies) for the S-A Shards of block 1920 and1908 may send the query to the identified Replicas, to be executedrelative to the identified Shards.

Opening loop block 1925 to closing loop block 1940 may iterate for eachS-A Replica identified in block 1920 and block 1922. If no Replica wasidentified, then the search may take place at the Primary.

At block 1930, the Replica may execute the query using the Lookup Key ofblock 1905 (or a hash thereof) and the Inverted Bitmap Index 394 valuesof block 1908. In addition, if the query contains or refers to anInsight 375 record, which Insight 375 was stored in RAM for rapid searchby Search and Analytics Index Module 2100, Query Module 1900 may alsofilter the S-A Shard(s) according to such filterable Insight 375 recordcomponents.

The result will be a subset of MPID 332 records in the S-A Primaryand/or Replica(s), which subset of MPID 332 records will refer to iPID330 records.

For example, a search of a SKU number of a shoe may, at block 1915,identify a set of G-S Shards containing the SKU and, at block 1920, aset of corresponding S-A Shards. The query may additionally oralternatively contain a Store which may be used at block 1908 in theInverted Bitmap Index 394 to resolve the query to S-A Shards. The querymay also request Substitutes for the SKU number at the Store, which maybe a filterable value in an Insight 375 record which may be stored inRAM in the S-A Shard. The Lookup Key, Inverted Bitmap Index values,and/or filterable Insight 375 record values may be used at block 1930 tofurther refine the set of MPID 332 results.

At optional block 1932, queries addressed to fields in Indix Datastore300 which are not available for real time searching, such as queriesaddressed to fields not stored in RAM, may be executed at a slower rate.In such cases, a notification may be sent to the user when the resultsare available to be viewed.

The result may be a set of MPID 332 records in the filtered S-A Shards,which MPID 332 records link to iPID 330 records in the G-S Shards.

Closing loop block 1940 may return to opening loop block 1925 to iterateover the next S-A Replica identified in block 1920 and block 1922.

At block 1945, the results from more than one Replica (if used) may becollected and aggregated.

At block 1950, Query Module 1900 may get G-S Shard records correspondingto the ranked MPID 332 results of blocks, such as according to thesubset of iPID 330 records of block 1930 and/or 1935. The iPID 330records may be used to allow the user to see search results specific toa specific iPID 330 and URI 305, including a specific product image,webpage snapshot, Product Attributes 345 and Price Attributes 340obtained in Parse Results 325 from the webpage.

At block 1955, Query Module 1900 may return the results to the userand/or to a software application which is servicing the user, such asUser Contact Module 1700.

At closing loop block 1960, Query Module 1900 may return to opening loopblock 1901 to iterate over the next user query, if any.

At concluding block 1999, Query Module may conclude or may return to aprocess which spawned it.

FIG. 20 is a flowchart illustrating an embodiment of Get Store IndexModule 2000. Get Store Index Module 2000 may be executed to indexrecords stored in Get Store Primary 185, Replica's thereof, and Shardsthereof, such as G-S Share 1 140 to G-S Shard N 143 (which may bereferred to as “G-S Shard”).

Opening loop block 2005 to closing loop block 2045 may iterate for eachG-S Shard into which a record is to be stored. Opening loop block 2010to closing loop block 2040 may iterate for each MPID 332 in each recordwhich is to be stored. Opening loop block 2015 to closing loop block2035 may iterate for each iPID 330 in each MPID 332 in each record whichis to be stored.

At block 2020, Get Store Index Module 2000 may get the values of indexkeys in the record. The index keys may be high cardinality, frequentlysearched terms and/or combinations of terms. Examples of such index keysinclude SKU plus store identifier, UPC number, MPN plus Brandidentifier, and URI (from which iPID 330 may be reversibly derived—inother words, it may be possible to determine iPID 330 from a URI).

At block 2025, Get Store Index Module 2000 may individually hash theindex key values. The hash process may be lossless. At block 2030, thehashed index key values may be stored as Lookup Key Hash 385 records.Indexing of these records in this manner allows rapid identification andretrieval of corresponding records, as the corresponding webpages arecrawled, providing essentially “live” search results, notwithstandingthe very large number of webpages for which data is obtained. Indexrecords may generally be stored or grouped as Index 370 records.

At block 2035, Get Store Index Module 2000 may return to block 2015 toiterate over the next iPID 330 in each MPID 332 in each record which isto be stored. At block 2040, Get Store Index Module 2000 may return toblock 2010 to iterate over the next MPID 332 in each record which is tobe stored. At block 2045, Get Store Index Module 2000 may return toblock 2005 to iterate over the next G-S Share into which each record isto be stored.

At concluding block 2099, Get Store Index Module 2000 may conclude orreturn to a process which may have spawned it.

FIG. 21 is a flowchart illustrating an embodiment of Search andAnalytics Index Module 2100. Search and Analytics Index Module 2100indexes records stored in Search-Analytics Primary 175 and/orSearch-Analytics Replica-1 176 to -N 177, allowing for rapid search andidentification of such records.

Opening loop block 2105 to closing loop block 2125 may iterate for eachrecord stored in each S-A Shard 1 136 to S-A Shard N 137 (“S-A Shard”),such as during execution of Data Ingestion Module 1800.

At block 2110, Search and Analytics Index Module 2100 may update and/orcreate an inverted bitmap index for low cardinality terms which arenonetheless significant and/or which are used frequently in queries.Examples include Brand, Category (which may be a Category 335 record),Store, and geographic identifier (such as an address, region, city,state, or other geographic area or identifier). At block 2110, Searchand Analytics Index Module 2100 may also include in the inverted bitmapindex an index of the value of tokens in a Product Title for the record.As used herein, a Product Title 395 record may be obtained from a ParseResult 325 from a URI 305, and is generally given or provided by aMerchant on a webpage. Product Title 395 records generally comprise asubset of Product Attributes 345, such as the Product Attribute 345types which occur most commonly within a Category 335 and/or within anMPID 332. A. The inverted bitmap index may be stored as Inverted BitmapIndex 394, which may be referenced by or grouped in Index 370.

At block 2115, those Insight 375 records which are available to bequeried by users may be stored in RAM for the S-A Shard, such as in RAM251, while other records may be stored in SSD, such as in SSD 253.Insight 375 records include the output of Volatility Module 700,Substitution Module 800, Mix Module 900, Prediction Module 1000,Competition Module 1100, Promotion Module 1200, Leadership Module 1300,Premium Module 1400, Price Range Module 1500, and Reach Module 1600. Theoutput of such modules may be available to be searched by users; asubset of this output may be available for rapid searching and may bestored in RAM 251 for the S-A Shard.

At closing loop block 2125, Search and Analytics Index Module 2100 mayreturn to opening loop block 2105 to iterate over the next record storedin each S-A Shard.

The above Detailed Description of embodiments is not intended to beexhaustive or to limit the disclosure to the precise form disclosedabove. While specific embodiments of, and examples are described abovefor illustrative purposes, various equivalent modifications are possiblewithin the scope of the system, as those skilled in the art willrecognize. For example, while processes or blocks are presented in agiven order, alternative embodiments may perform modules havingoperations, or employ systems having blocks, in a different order, andsome processes or blocks may be deleted, moved, added, subdivided,combined, and/or modified. While processes or blocks are at times shownas being performed in series, these processes or blocks may instead beperformed in parallel, or may be performed at different times. Further,any specific numbers noted herein are only examples; alternativeimplementations may employ differing values or ranges.

1. A computer implemented method of storing information and searchingthe stored information in close-to realtime, the method comprising: at afirst computer comprising a processor and a memory, which memorycomprises: an attribute datastore for storing price and productattributes for a set of products, which price and product attributes areobtained from webpages accessed via Uniform Resource Identifiers(“URIs”), a history datastore for storing historical price and productattributes for the set of products, and an analytics datastore forstoring the result of an analysis of the price and product attributes inthe attribute datastore and the history datastore; at the first computerreceiving a set of price and product attributes obtained from a websitefor a first product, which attributes comprise a first category in acategory taxonomy; at the first computer determining a geographic areaof the website; at the first computer determining a replica of theattribute datastore and a replica of the analytics datastore, based onthe geographic area of the website and the first category; by the firstcomputer storing the price and product attributes in the determinedreplica of the attribute datastore; performing the analysis and storingthe result of the analysis in the determined replica of the analyticsdatastore; with respect to the price and product attributes stored inthe determined replica of the attribute datastore, obtaining a set ofvalues of high cardinality entries in the price and product attributesand hashing each such value to form a set of index key hash values;forming an inverted bitmap index of a subset of the result of theanalysis of the price and product attributes stored in the determinedreplica of the analytics datastore; receiving a query; hashing a highcardinality search term in the query, if any, and comparing the hashedhigh cardinality search term with the set of index key hash values todetermine that the replica of the attribute datastore comprises a set ofrecords responsive to the query or, if the query does not comprise ahigh cardinality search term, searching for terms in the query accordingto the inverted bitmap index to identify the corresponding replica inthe analytics datastore and the set of records responsive to the querytherein; in response to the query, returning the set of recordsresponsive to the query. 2-23. (canceled)